Optimizing differential expression analysis for proteomics data via high-performing rules and ensemble inference

Identification of differentially expressed proteins in a proteomics workflow typically encompasses five key steps: raw data quantification, expression matrix construction, matrix normalization, missing value imputation (MVI), and differential expression analysis. The plethora of options in each step makes it challenging to identify optimal workflows that maximize the identification of differentially expressed proteins. To identify optimal workflows and their common properties, we conduct an extensive study involving 34,576 combinatoric experiments on 24 gold standard spike-in datasets. Applying frequent pattern mining techniques to top-ranked workflows, we uncover high-performing rules that demonstrate optimality has conserved properties. Via machine learning, we confirm optimal workflows are indeed predictable, with average cross-validation F1 scores and Matthew’s correlation coefficients surpassing 0.84. We introduce an ensemble inference to integrate results from individual top-performing workflows for expanding differential proteome coverage and resolve inconsistencies. Ensemble inference provides gains in pAUC (up to 4.61%) and G-mean (up to 11.14%) and facilitates effective aggregation of information across varied quantification approaches such as topN, directLFQ, MaxLFQ intensities, and spectral counts. However, further development and evaluation are needed to establish acceptable frameworks for conducting ensemble inference on multiple proteomics workflows.


Description of Additional Supplementary Files
File Name: Supplementary Data 2.

Description:
The workflow lists, our benchmarking results and the Kruskal-Wallis test results.supp2.Tab1-supp2.Tab6 give the workflow list, mean and median performance for each metric (e.g., mean_pauc001 refers to the average pAUC(0.01)score across the benchmark datasets), their ranking positions based on single metrics (e.g., rank_mean_pauc001 refers to the ranking of workflows based on their mean_pauc001 values), their average ranks across the five metrics (e.g., avg_rank_mean, calculated by averaging rank_mean_pauc001, rank_mean_pauc005, etc.), and their final rank positions (the last column in each table, obtained by ranking the workflows with their avg_rank_mean value).supp2.Tab7 shows the Kruskal-Wallis test results of the top 30 workflows under each of the settings.
File Name: Supplementary Data 3.

Description: Workflow performance levels and classification results with CatBoost.
In each sheet, three tables were used to show the workflow performance level classification results under a specific quantification setting.For example, in the sheet "FG_DDA", supp3.
Tab1 shows the details of the workflows and their label and class information.We labelled the workflows ranking at top 5% as the "H" class and have the label of 1.If the workflow is ranked between 5% and 25%, then it has the class of "RH" and label of 2. Similarly, the 25%-50% workflows are in "RL" class with labels of 3 and the remaining bottom 50% workflows are in "L" class and with the labels of 0. In supp3.Tab2, the performance level classification results of the 10-fold cross validation are shown.We used different performance indicators such as F1 score and MCC, the mean performances (mean) and their standard deviations (std) are shown in the last two rows.In supp3.Tab3, the feature importances are shown.
File Name: Supplementary Data 4.

Description:
The results of using the linear model-based for checking the interactions between the workflow step options and the workflow ranking.
In each sheet, the results of the linear regression model for evaluating the interaction between predictive variables and the response values, and the extracted ANOVA table are shown.For example, in the sheet "FG_DDA", the FG_DDA workflows related linear model results are shown in supp4.Tab1.In supp4.Tab2, the extracted ANOVA table is presented.
File Name: Supplementary Data 5.
Description: Frequent pattern mining results.
In each sheet, the frequent pattern mining results are shown for a specific setting.For example, in sheet "FG_DDA", supp5.Tab1 gives the frequent patterns with support ration values higher than 0.1 and mined from "H" workflows under setting FG_DDA, while the supp5.Tab2 gives the frequent patterned with support ration values higher than 0.1 and mined from the "L" workflows of setting FG_DDA.
File Name: Supplementary Data 6.

File Name :
Supplementary Data 1 Description: The LODOCV results for measuring the generalizability of our benchmarking under different settings.supp1.Tab1~supp1.Tab6 give the LODOCV results under settings of FG_DDA, MQ_DDA, DIANN_DIA, spt_DIA, FG_TMT and MQ_TMT respectively.In each table, the first column (dataset) shows the dataset name.The second column (contrast_index) is the contrast index for each dataset showing in the first column.The third column (mean_spearman) shows the LODOCV performances indicated by the spearman correlation coefficients, and a workflow's performance is indicated by the mean performance across all the testing datasets during the benchmarking.Similarly, the fourth column (median_spearman) presents the spearman correlation coefficients of the LODOCVs but the workflow's performance is calculated as the median performance across the testing datasets.The last two columns (mean_pearson and median_pearson) are the results of corresponding mean and median performance based LODOCVs indicated by the Pearson's correlation instead of the spearman correlation.Below the tables, we show the mean and median correlation coefficients.